6 - PCA, Harmony Integration, and kNN Graphs

Author

CDN team

Published

May 21, 2025

Introduction

This vignette is a hands-on guide to understanding and applying principal component analysis (PCA), k-nearest neighbors (kNN) graph creation, and harmony integration in the context of scRNAseq data analysis. PCA and Harmony are two common methods for creating a dimensionality-reduced “latent space” from scRNAseq cell x gene matrices. “Latent space” is the term used to describe any matrix that can be calculated with a reduced size while still preserving the relationships between samples of the original matrix. PCA creates the latent space by deriving the combinations of genes that describe the greatest variance across samples in the dataset. Harmony creates a latent space that reduces a batch effect by reducing differences between batches within cluster or latent space variables separately. We calculate the final kNN graph of relationships between cells based on a latent space, because the process of creating the graph scales multiplicatively with number of input features.

Vocabulary

Latent Space

Any matrix that can be calculated from a data matrix with a reduced size while still preserving the relationships between samples of the original matrix. Latent means “existing as potential”, as in “latent abilities” - so a latent space is a hidden set of dimensions in the data that could be useful in ways the raw data is not.

Feature

A variable in a sample x variable matrix. In a cell x gene matrix, the gene is the feature. In PCA, the principal components are the features.

Dimensionality Reduction

The computational process of reducing a large data matrix to fewer features. For example, in scRNAseq, PCA is the most popular dimensionality reduction technique. With it we can reduce our features from ~30,000 genes to ~30 principal components, reducing the computational workload of complex analysis tasks, like creating the nearest neighbors graph.

Embeddings

This is another term for a latent space, but used to refer to what we have done to the cells. The two terms are most often interchangeable. This term implies the “embedding” or placement of the cells in a different feature space.

Loadings

This is the term used to describe the feature translation matrix. Which genes are in which PC? The Loadings matrix shows the weights of each gene in each PC. A negative value in this matrix means expression of the gene pushes cells in the negative direction of the PC, a positive value pushes the cell in the positive direction.

Additional Reading

Principal Component Analysis (PCA)

  • A short synopsis of PCA by Josh Starmer, aka StatQuest - This video uses scRNAseq data to explain some ideas behind PCA
  • Understanding PCA by Victor Powell - a visually interactive explanation of the algorithm behind PCA.
  • Jolliffe, I.T., & Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A, 374(2065), 20150202. read here

Harmony Integration

k-Nearest Neighbors (kNN) Graphs

  • A nice introduction to kNN graphs can be found at kNN Visualization. We recommend this site in general for learning about algorithms in the context of single-cell RNA sequencing!

Key Takeaways

  • PCA and other latent-space algorithms are essential in the scRNAseq workflow for dimensionality reduction. PCA is especially useful because it helps to visualize and interpret major trends and variability in the data. PCA output:
    • a Cell x PC matrix; a new set of variables that combine the most covarying sets of genes in the dataset, for example M phase vs. G1 phase, Immune vs. Epithelial, Sick vs. Healthy.
    • a PC x Gene matrix; the loadings or weights of each gene’s importance in each PC
    • a Variance Explained list; the standard deviation in the dataset captured by each PC
  • kNN graphs define neighborhoods of cells based on similarity. This is the ultimate goal of scRNAseq data processing, because it allows us to identify clusters of cells. kNN output:
    • a Neighborhood x Cell matrix; A matrix showing which cells are in which neighborhoods
  • Harmony integration minimizes batch effects by regularizing batches in any specified latent space. In benchmarking studies, it is often highlighted as the best integration method, balancing the removal of technical noise with the maintenance of known biological signals, like cell cycle. Like PCA, Harmony creates a latent space that can be visualized. Harmony output:
    • a matrix of harmony integrated PCs per cell
  • These methods are not only relevant in single-cell analysis but also applicable across various fields of data science and can be useful for analysis of any complex dataset.

Libraries

### Make sure all the packages are installed
if (!requireNamespace("Seurat", quietly = TRUE))
    install.packages("Seurat")

if (!requireNamespace("tidyverse", quietly = TRUE))
    install.packages("tidyverse")

if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")

if (!requireNamespace("harmony", quietly = TRUE))
    devtools::install_github("immunogenomics/harmony")

if (!requireNamespace("colorBlindness", quietly = TRUE))
    install.packages("colorBlindness")

if (!requireNamespace("DT", quietly = TRUE))
    install.packages("DT")    

if (!requireNamespace("scales", quietly = TRUE))
    install.packages("scales") 

if (!requireNamespace("ggraph", quietly = TRUE))
    install.packages("ggraph") 

if (!requireNamespace("tidygraph", quietly = TRUE))
    install.packages("tidygraph") 

if (!requireNamespace("ggforce", quietly = TRUE))
    install.packages("ggforce") 

if (!requireNamespace("ggalluvial", quietly = TRUE))
    install.packages("ggalluvial") 

if (!requireNamespace("corrplot", quietly = TRUE))
    install.packages("corrplot") 

if (!requireNamespace("stats", quietly = TRUE))
    install.packages("stats") 

if (!requireNamespace("fastDummies", quietly = TRUE))
    install.packages("fastDummies") 


### Load all the necessary libraries
library(Seurat)
library(tidyverse)
library(devtools)
library(harmony)
library(colorBlindness)
library(DT)
library(scales)
library(ggraph)
library(tidygraph)
library(RColorBrewer)
library(scales)
library(colorRamps)
library(ggforce)
library(ggalluvial)
library(corrplot)
library(stats)
library(fastDummies)

set.seed(687)

Load data

We’re going to be working with a dataset from the paper - Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19 Download data from the cellxgene portal.

se <- readRDS("../data/se_qc.rds")

Color palette

donor_pal <- c(
    "#66C2A4", "#41AE76", "#238B45", "#006D2C",
    "#41B6C4", "#1D91C0", "#225EA8", "#253494",
    "#FD8D3C", "#FC4E2A", "#E31A1C", "#BD0026",
    "#ad393b", "#800000", "#800050")

names(donor_pal) <- c(
    "T024", "T036", "T44", "T057", "T110", "T160", "T161", "T182",
    "T017", "T019", "T176", "T189", "T197", "T203", "T202"
)

To get up to speed with the previous worksheets, process the data in the same way.

se <- se %>%
    NormalizeData(verbose = FALSE) %>%
    FindVariableFeatures(
        method = "vst",
        nfeatures = 3000,
        verbose = FALSE) %>%
    ScaleData(verbose = FALSE, features = VariableFeatures(.))

Analysis

PCA

When we run PCA on our Seurat object using the Seurat package’s implementation, PCA is calculated on the scaled, highly variable genes. The resulting PCA latent space is placed in the reductions slot.

se <- se %>% 
    RunPCA(
        npcs = 30,
        ndims.print = 1:5,
        nfeatures.print = 10
    )

# This is where latent spaces, aka dimensional reductions are stored. If we call it directly, we can see some information about the PCA.
se@reductions$pca
A dimensional reduction object with key PC_ 
 Number of dimensions: 30 
 Number of cells: 22502 
 Projected dimensional reduction calculated:  FALSE 
 Jackstraw run: FALSE 
 Computed using assay: RNA 

Accessing PCA results

The PC x Cell latent space is stored in the cell.embeddings slot. The Gene x PC matrix of gene weights per PC is stored in the feature.loadings slot. The variance explained per PC is placed in the stdev slot

DT::datatable(se@reductions$pca@cell.embeddings)
DT::datatable(se@reductions$pca@feature.loadings)
(se@reductions$pca@stdev) %>% head
[1] 13.706671 10.723800  7.989363  7.505656  6.676632  6.306823

Using PCA to understand variance in your data

By visualizing the top 2 principal components, we can start to understand the major axes of variance of our data. Seurat has built-in functions to graph the PCs and loadings

se@meta.data$nCount_RNA <- as.numeric(se@meta.data$nCount_RNA)
    
# FeaturePlot allows us to plot any numerical value on any latent space representation of the data. Here we can use it to color PCA by number of RNA counts per cell
nCount_RNA_pca <- FeaturePlot(
        se,
        reduction = "pca",
        features = 'nCount_RNA'
        )

nCount_RNA_pca

# DimPlot allows us to color the latent space based on categorical values per cell, like cell type or which sample the cell is from
celltype_pca <- DimPlot(
        se,
        reduction = "pca",
        group.by = 'category'
        ) 

celltype_pca

# We can also plot further PCs by specifying the dimensions to plot
DimPlot(
    se,
    reduction = 'pca',
    group.by = 'category',
    dims = c(2,3)
    )

sample_pca <- DimPlot(
        se,
        reduction = "pca",
        group.by = 'sample_id'
        ) + scale_color_manual(values = donor_pal)

sample_pca

Based on our DimPlot of the first 2 PCs, we can see the major axes of variation in the data are

  • PC1: epithelial (negative) vs. immune/stromal (positive)

  • PC2: immune (positive) vs. mesenchymal (negative)

By visualizing loadings of these PCs, we can see which genes are driving the differences between these groups of cells

VizDimLoadings(
    se, 
    dims = 1:2, 
    reduction = "pca",
    # By setting balanced = TRUE, we see top loadings in each direction of each PC
    balanced = TRUE)

And indeed, we can see that known epithelail genes MEP1A, FABP1, EPCAM are in the negative direction of PC_1, and immune markers like CD79A and MS4A1 are in positive direction of PC_2

What if I used 10, 30, 50, or even 100 PCs?

So far we haven’t even looked beyond PC’s 1 and 2. The PCA algorithm decides on the first PCs first, then moves down to the next. If we recalculate PCA, asking for 10 PCs, we will see that PC_1 and PC_2 are the same, except sometimes when the directions are flipped. The directions of PCs are randomly chosen at the beginning of the algorithm.

If you asked for 100 PCs, you would see the same again for PCs 1 and 2.

The maximum number of PCs is the total number of samples -1. so in this case, it would be nCells - 1. As we increase the number of PCs, we increase the total variance explained by the PCA in the output matrices.

se <- se %>% 
    RunPCA(
        npcs = 50,
        verbose = FALSE
    )

VizDimLoadings(
    se, 
    dims = c(1:2, 20:21, 40:41),
    reduction = "pca",
    # By setting balanced = TRUE, we see top loadings in each direction of each PC
    balanced = TRUE)

But what if we think we are missing out on some of the data by using too few PCs? Seurat has two methods for deciding the number of PCs.

Deciding the number of Principal Components

The most popular method, the Elbow plot method, is the qualitative choice of nPCs based on deminishing returns on percent variance explained by increasing nPCs.

By default, the ElbowPlot shows the standard deviation of each PC. The standard deviation is the square-root of the variance. The total variance in the dataset is stored in se@reductions$pca@misc$total.variance - to obtain the percent variance explained by each PC, we can take the square of the stdev and divide it by the total variance

ElbowPlot(se, ndims = 50)

varExplained <- data.frame(
    varExplained = se@reductions$pca@stdev^(2) / se@reductions$pca@misc$total.variance,
    PC = seq(1:length(se@reductions$pca@stdev))
    )

ggplot(varExplained, 
       aes(x = PC, 
           y = varExplained)) + 
    geom_point() +
    theme_classic() +
    ylab('percent variance explained per PC')

For the purposes of the elbow plot, we can see that the stdev is roughly equivalent to the variance explained per PC.

The Elbow plot is the simplest way to view deminishing returns on variance explained per PC.

Then we simply cut the nPCs off at that point to prevent our PCA from being overly complex, or overfitting the data. One way we can quickly show what noise looks like in our scRNA data is by injecting noise into our dataset.

In standard data science workflows, PCs are included until some threshold of variance explained is reached, often 85%. But if we calculate the total variance explained by each PC up to the cutoff we chose by ElbowPlot, we can see it’s only 20%. This is because scRNAseq data are highly noisy with a great deal of unique variance per cell. We would have to include 100s of PCs to reach 85% variance explained.

And, keep in mind this total variance is calculated based only on the highly variable genes we chose.

varExplained <- data.frame(
    varExplained = se@reductions$pca@stdev^(2) / se@reductions$pca@misc$total.variance,
    PC = seq(1:length(se@reductions$pca@stdev)),
    data = 'subsampled real data')

varExplained$cumulativeSum <- cumsum(varExplained$varExplained)

ggplot(varExplained, 
       aes(x = PC, y = cumulativeSum, color = data)) +
    geom_point() +
    scale_y_continuous(limits = c(0,1)) +
    theme_classic() +
    ylab('cumulative percent variance explained per PC') +
    ggtitle('How much of the total variance in the dataset\ndid we explain with the PCs chosen by our ElbowPlot()?')

Practicing PCA on a subset of the data

For those curious about just how many PC’s we would need to reach that threshold of 85% variance explained, we need to subset our dataset to make visualization and comparisons reasonable.

# select a random sample of 1000 cells
seSub <- subset(se, downsample = 1000) %>%
        NormalizeData() %>%
        FindVariableFeatures() %>%
        ScaleData() %>%
        RunPCA(npcs = 100)

varExplainedsub <- data.frame(
    varExplained = seSub@reductions$pca@stdev^(2) / seSub@reductions$pca@misc$total.variance,
    PC = seq(1:length(seSub@reductions$pca@stdev)),
    data = 'subsampled real data')

varExplainedsub$cumulativeSum = cumsum(varExplainedsub$varExplained)

ggplot(varExplainedsub, 
       aes(x = PC, y = cumulativeSum, color = data)) +
    geom_point() +
    theme_classic() +
    geom_hline(yintercept = 0.85, color = 'red') +
    ylab('cumulative percent variance explained per PC') +
    ggtitle('If we really wanted to include 85% of the variance in our data\n how many PCs would we need?',subtitle = 'in a subset of 1000 cells')

kNN + sNN graphs

Because these data objects can be large, the nearest neighbors graphs are normally stored in the “graphs” slot of the seurat object in a highly compact form that is not readable. We can optionally add these graphs as sets of matrices using the “return.neighbor” parameter:

se <- FindNeighbors(
            se, 
            # We must set return.neighbor = TRUE to be able to access these matrices.
            # I wasn't able to get the shared nearest neighbors graph to save as matrices, but the kNN appears.
            return.neighbor = TRUE,
            k.param = 30
)
# Now we can print which cells are closest to one another as a matrix
se@neighbors$RNA.nn@nn.idx[1:10, 1:10]
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
 [1,]    1   396   617  1090 15620  5518   182     3  9481 14898
 [2,]    2 17427 13748 20962 15771 14163   625    26 15491 21014
 [3,]    3    46  4523   103 15963   280   234 12224 12710  5118
 [4,]    4 15859   758  4523   103  5328     3   807   813 14500
 [5,]    5 19375 14124   226   425 20598    80 15755 10060 10283
 [6,]    6   690   485   244   984   687 17706   338   694  2489
 [7,]    7   616  2928  2765  3212   974 14695  2676  3585  2932
 [8,]    8   198   567 17774  2529 13507   904 20478 11117 17858
 [9,]    9  8857  8181 11396  9220 22127  1926  2016  1894 17799
[10,]   10   395   494  1074 18073  1487  2485  1117 11221 17873

The nearest neighbors graphs are stored in the neighbors slot of the seurat object, and are stored as two matrices: - se@neighbors$RNA.nn@nn.dist; the matrix of distances between each cell (rows) and its closest k.param neighbors (columns) - se@neighbors$RNA.nn@nn.idx; the matrix of which cell those closest k.param cells are. But because these are matrices, it also stores a list of the cell names for index <-> cellname translation - se@neighbors$RNA.nn@cell.names

se <- FindNeighbors(
            se, 
            k.param = 30
)

You can imagine that if we try to display 30 lines from each cell connecting to other cells on a dataset of 60,000 cells, we would be trying to view >1,000,000 lines! This not only isn’t computationally feasible for many computers, but it’s also just not readable to the eye.

For educational purposes let’s visualize this graph on a subset of 1000 cells:

seSub <- subset(se, downsample = 1000) %>%
        NormalizeData() %>%
        FindVariableFeatures() %>%
        ScaleData() %>%
        RunPCA(npcs = 30)

seSub <- FindNeighbors(
    seSub,
    k.param = 30,
    return.neighbor = TRUE
              ) 

Here we write a function that takes in a Seurat object with a KNN-graph in se@neighbors$RNA.nn@nn.idx and returns a plot with the edges between cells.

We’ve included a function here to graph the kNN graph

process_and_graph_connectivity(seSub) + 
    labs(title = "K=30 Cell Connectivity in PCA Space", x = "PC_1", y = "PC_2")

We can see here that most cells are connected to those in their immediate vicinity, but cells that are more extreme along the x and y axes are also connected to more distant cells. If we decrease our k.param, this will be less pronounced.

These nearest neighbors are based on the whole PCA latent space - the distances are decided based on the “euclidean distance” between cells in ALL of the dimensions of the PCA.

seSub <- FindNeighbors(
    seSub,
    k.param = 5,
    return.neighbor = TRUE
    ) 

process_and_graph_connectivity(seSub) + 
    labs(title = "K=5 Cell Connectivity in PCA Space", x = "PC_1", y = "PC_2")

Clusters are then calculated based on the shared nearest-neighbor SNN graph. The shared-nearest neighbor graph is a graph where the distances are replaced with the number of shared nearest neighbors between each cell and its nearest neighbors. So the se@neighbors$RNA.snn@nn.idx matrix is the same, showing which cells are the nearest neighbors, but the `se@neighbors$RNA.snn@nn.dist would be an integer number, the number of neighbors that are the same between each pair of cells. The SNN graph is used to make the clustering more robust to outliers.

Let’s see what happens to clusters when we modulate our k.param.

seSub <- FindNeighbors(
    seSub,
    k.param = 20
    ) 

seSub <- FindClusters(seSub)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 1000
Number of edges: 25021

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8957
Number of communities: 13
Elapsed time: 0 seconds
knn20 <- DimPlot(
    seSub,
    group.by = 'seurat_clusters',
    reduction = 'pca') +
    ggtitle('k.param = 20')

seSub <- FindNeighbors(
    seSub,
    k.param = 5
    ) 

seSub <- FindClusters(seSub)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 1000
Number of edges: 7291

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9226
Number of communities: 22
Elapsed time: 0 seconds
knn5 <- DimPlot(
    seSub, 
    group.by = 'seurat_clusters',
    reduction = 'pca') +
    ggtitle('k.param = 5')

(knn5 + knn20)

As we can see, with a higher k.param, there are fewer clusters. This is because the neighborhoods are smaller when we ask for fewer neighbors! However, in random sub-samples of the data, these clusters will be less stable because the neighbors depend on who is kept in. With a smaller k, we might get spurious clusters. Therefore, choosing k is a balance of stability and resolution on rare cell types.

celltype_pca <- DimPlot(
        seSub,
        reduction = "pca",
        group.by = 'category'
        ) 

sample_pca <- DimPlot(
        seSub,
        reduction = "pca",
        group.by = 'sample_id'
        ) + scale_color_manual(values = donor_pal)

(knn5 + knn20) / (celltype_pca + sample_pca)

Even with a smaller k, we are finding multiple clusters within the authors’ overarching “classical monocyte” label. However, we see a reasonable distribution of donors across them.

If we make an Alluvial plot, we can see the differences between our clusters and the author labels

se <- se %>%
    FindNeighbors(k.param = 5) %>%
    FindClusters(resolution = 0.05) %>% # let's use a very small resolution for simplicity's sake
    RunUMAP(
        reduction = 'pca',
        dims = 1:30,
        reduction.name = 'pca_umap')
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 22502
Number of edges: 214998

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9907
Number of communities: 9
Elapsed time: 0 seconds
metadata <- se@meta.data
plot_data <- as.data.frame(table(metadata$seurat_clusters, metadata$category))
colnames(plot_data) <- c("Cluster", "category", "Freq")

ggplot(plot_data,
       aes(axis1 = Cluster, axis2 = category, y = Freq)) +
    geom_alluvium(aes(fill = Cluster), width = 0.1) +
    geom_stratum(width = 0.1) +
    geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
    scale_x_discrete(limits = c("Cluster", "Cell Type"), expand = c(0.15, 0.05)) +
    labs(title = "Alluvial Plot from Seurat Clusters to Cell Types",
         x = "Clusters and Cell Types",
         y = "Count") +
    theme_minimal()

Our low resolution clustering has captured the major cell types: 0 and 1 appear to be epithelial cells, 2 is mesenchymal, 3 are B, 4 is T cells, 6 is myeloid and 8 is endothelial. But 7 is a secondary B celltype

I like to immediately plot the distribution of samples per celltype to make sure we are getting useful clusters in our data.

se@meta.data %>%
    count(sample_id, seurat_clusters) %>% 
    group_by(seurat_clusters) %>%
    mutate(composition = n/sum(n)) %>%
    ggplot(aes(x = seurat_clusters, y = composition, fill = sample_id)) + 
        geom_bar(position = 'stack', stat = 'identity') + 
        scale_fill_manual(values = donor_pal)

So we have a good distribution of samples across our normal celltypes, and 3 and 7 seem to disproportionately come from one sample (T036). We can plot a density distribution to show the counts coming from that sample and find that the sample has a lot more data which could introduce more noise.

ggplot(se@meta.data, aes(x=n_counts)) + geom_density() + facet_wrap("~sample_id")

se@meta.data %>%
    count(`inferred state`, seurat_clusters) %>% 
    group_by(seurat_clusters) %>%
    mutate(composition = n/sum(n)) %>%
    ggplot(aes(x = seurat_clusters, y = composition, fill = `inferred state`)) + 
        geom_bar(position = 'stack', stat = 'identity')

Here we can see that clusters 3 and 7 have the most control population and the least non inflamed.

We can get the best idea of what’s going on there with a differential expression, but let’s save that for later.

Assessing Harmony integration

Switching gears a little bit before we get ahead of ourselves making clusters, let’s double check we don’t have any batch effects at play that may cause batch-specific clustering.

Usually our batch effects will show up as some vector of variation across a metadata variable. This means they will usually show up in our PCA as a PC, so we can check for them by correlating our PCs with our metadata!

plot_batch_effect_heatmap <- function(
                                     seurat_object, 
                                     latent_space = 'pca', 
                                     num_pcs = 20, 
                                     metadata_columns_to_include = NULL,
                                     plot_title = "Correlation Heatmap of Metadata and PCA Components") {
    metadata <- seurat_object@meta.data
    metadata <- metadata[, metadata_columns_to_include, drop = FALSE]
    pca_data <- Embeddings(seurat_object, latent_space)[, 1:num_pcs] # pull the PCs we want to check

    categorical_cols <- sapply(metadata, function(col) is.factor(col) || is.character(col))
    
    # convert character columns to factors for one-hot encoding
    metadata[categorical_cols] <- lapply(metadata[categorical_cols], as.factor)

    #1hot without removing redundant dummies
    metadata_encoded <- fastDummies::dummy_cols(
        metadata,
        remove_first_dummy = FALSE,
        remove_selected_columns = TRUE)

    combined_data <- cbind(metadata_encoded, pca_data) # combine metadata and PCA data
    
    # calculate the correlation matrix
    correlation_matrix <- cor(combined_data, use = "pairwise.complete.obs")

    # Subset the correlation matrix to include only metadata and PCA components
    metadata_columns <- colnames(metadata_encoded)
    pca_columns <- colnames(pca_data)
    subset_correlation_matrix <- correlation_matrix[metadata_columns, pca_columns, drop = FALSE]
    
    # Check for any NA values in the subset correlation matrix and replace them with 0
    subset_correlation_matrix[is.na(subset_correlation_matrix)] <- 0
    
    # Create the heatmap using corrplot
    corrplot(subset_correlation_matrix, method = "circle", tl.col = "black", tl.srt = 45,
             number.cex = 0.7, tl.cex = 0.7,
             main = plot_title)
}

plot_batch_effect_heatmap(
    se, 
    metadata_columns_to_include = c('Age',  "sample_id", "n_counts", "Diagnosis", "inferred state")) # 'sex', "disease", "Severity"

We don’t really have any metadata that are correlating with our PCs. So it seems we’ve found a great dataset without much batch effect! Out of all of the PCs here, PC8 shows strongest correlations with datapoints that we probably don’t want to include in scientific discourse - it’s highly correlated with a single patient, and that likely causes a spurious correlation with the metadata this patient has. So, for educational purposes, let’s see what happens when we integrate over patient_id to try to remove this effect.

Integration with Harmony

se <- se %>% 
    RunHarmony(
        group.by.vars = 'sample_id',
        reduction.use = 'pca',
    ) %>%
    FindNeighbors(reduction = 'harmony', k.param =5) %>%
    FindClusters(resolution = 0.05)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 22502
Number of edges: 221268

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9908
Number of communities: 10
Elapsed time: 0 seconds
se <- RunUMAP(se, reduction = "harmony", dims = 1:30, reduction.name = "harmony_umap")

h1_2 <- DimPlot(se,
        reduction = 'harmony',
        group.by = 'sample_id',
        dims = c(1,2)
        ) + 
        scale_color_manual(values = donor_pal) +
        NoLegend()

h1_8 <- DimPlot(se,
        reduction = 'harmony',
        group.by = 'sample_id',
        dims = c(1,8)
        ) + NoLegend() +
        scale_color_manual(values = donor_pal)

pcs1_2 <- DimPlot(se,
        reduction = 'pca',
        group.by = 'sample_id') + 
        scale_color_manual(values = donor_pal) +
        NoLegend()

pcs1_8 <- DimPlot(se,
        reduction = 'pca',
        group.by = 'sample_id',
         dims = c(1,8)) + 
        scale_color_manual(values = donor_pal) 

(pcs1_2 + pcs1_8) / (h1_2 + h1_8)

By comparing the PCs directly, we can already see the Flu 1 effect has disappeared from PC_8. Typically we would next want to check the effects of this change on the correlation structure of our whole dataset. We can do this by plotting again the correlation of metadata covariates

plot_batch_effect_heatmap(
    se, 
    latent_space = 'harmony',
    metadata_columns_to_include = c('Age',  "sample_id", "n_counts", "Diagnosis", "inferred state"),
    plot_title = "Correlation Heatmap of Metadata and Harmony latent variables")

I like to immediately plot the distribution of samples per celltype to make sure we are getting useful clusters in our data.

plt1 <- DimPlot(
    se,
    reduction = 'pca_umap',
    group.by = c('sample_id')) + 
    scale_color_manual(values = donor_pal) | DimPlot(
        se,
        reduction = 'pca_umap',
        group.by = c('category'))

plt2 <- DimPlot(
    se,
    reduction = 'harmony_umap',
    group.by = c('sample_id')) + 
    scale_color_manual(values = donor_pal) | DimPlot(
        se,
        reduction = 'harmony_umap',
        group.by = c('category'))

plt1 / plt2

Session Info

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] future_1.40.0        fastDummies_1.7.5    corrplot_0.95        ggalluvial_0.12.5    ggforce_0.4.2        colorRamps_2.3.4     RColorBrewer_1.1-3   tidygraph_1.3.1      ggraph_2.2.1         scales_1.3.0         DT_0.33              colorBlindness_0.1.9 harmony_1.2.3        Rcpp_1.0.14          devtools_2.4.5       usethis_3.1.0        lubridate_1.9.4      forcats_1.0.0        stringr_1.5.1        dplyr_1.1.4          purrr_1.0.4          readr_2.1.5          tidyr_1.3.1          tibble_3.2.1         ggplot2_3.5.2        tidyverse_2.0.0      Seurat_5.2.1         SeuratObject_5.0.2   sp_2.2-0            

loaded via a namespace (and not attached):
  [1] rstudioapi_0.17.1      jsonlite_2.0.0         magrittr_2.0.3         spatstat.utils_3.1-3   farver_2.1.2           rmarkdown_2.29         fs_1.6.6               vctrs_0.6.5            ROCR_1.0-11            memoise_2.0.1          spatstat.explore_3.4-2 htmltools_0.5.8.1      gridGraphics_0.5-1     sass_0.4.10            sctransform_0.4.1      parallelly_1.43.0      bslib_0.9.0            KernSmooth_2.23-26     htmlwidgets_1.6.4      ica_1.0-3              plyr_1.8.9             plotly_4.10.4          zoo_1.8-14             cachem_1.1.0           igraph_2.1.4           mime_0.13              lifecycle_1.0.4        pkgconfig_2.0.3        Matrix_1.6-5           R6_2.6.1               fastmap_1.2.0          fitdistrplus_1.2-2     shiny_1.10.0           digest_0.6.37          colorspace_2.1-1       patchwork_1.3.0        tensor_1.5             RSpectra_0.16-2        irlba_2.3.5.1          pkgload_1.4.0          crosstalk_1.2.1        labeling_0.4.3         progressr_0.15.1       timechange_0.3.0       spatstat.sparse_3.1-0  httr_1.4.7             polyclip_1.10-7        abind_1.4-8            compiler_4.3.2         remotes_2.5.0          withr_3.0.2           
 [52] viridis_0.6.5          pkgbuild_1.4.7         MASS_7.3-60.0.1        sessioninfo_1.2.3      tools_4.3.2            lmtest_0.9-40          httpuv_1.6.15          future.apply_1.11.3    goftest_1.2-3          glue_1.8.0             nlme_3.1-168           promises_1.3.2         grid_4.3.2             Rtsne_0.17             cluster_2.1.8.1        reshape2_1.4.4         generics_0.1.3         gtable_0.3.6           spatstat.data_3.1-6    tzdb_0.5.0             hms_1.1.3              data.table_1.17.0      spatstat.geom_3.3-6    RcppAnnoy_0.0.22       ggrepel_0.9.6          RANN_2.6.2             pillar_1.10.2          spam_2.11-1            RcppHNSW_0.6.0         later_1.4.2            splines_4.3.2          tweenr_2.0.3           lattice_0.22-7         survival_3.8-3         deldir_2.0-4           tidyselect_1.2.1       miniUI_0.1.1.1         pbapply_1.7-2          knitr_1.50             gridExtra_2.3          scattermore_1.2        RhpcBLASctl_0.23-42    xfun_0.52              graphlayouts_1.2.2     matrixStats_1.5.0      stringi_1.8.7          lazyeval_0.2.2         yaml_2.3.10            evaluate_1.0.3         codetools_0.2-20       cli_3.6.4             
[103] uwot_0.2.3             xtable_1.8-4           reticulate_1.42.0      jquerylib_0.1.4        munsell_0.5.1          globals_0.16.3         spatstat.random_3.3-3  png_0.1-8              spatstat.univar_3.1-2  parallel_4.3.2         ellipsis_0.3.2         dotCall64_1.2          profvis_0.4.0          urlchecker_1.0.1       listenv_0.9.1          viridisLite_0.4.2      ggridges_0.5.6         rlang_1.1.6            cowplot_1.1.3